The dataset provides insights into user behavior on an online
shopping site. It features various types of data, including continuous,
binary, and categorical variables. These variables encompass the count
of administrative actions (Administrative), interactions
related to products (ProductRelated), and binary markers
indicating whether a visit took place on a weekend
(Weekend) or led to a purchase (Revenue).
Furthermore, it includes categorical information such as the user’s
browser (Browser), location (Region), and type
of visitor (VisitorType).
The primary objective of this analysis is to assess the dataset with different distance metrics and determine the most suitable metric for each type of data. By examining the relationships among observations, this study seeks to improve the analysis of user behaviors and group patterns efficiently.
Distance metrics are essential in data science, particularly for clustering and classification. They assess how similar or different data points are and significantly affect the quality of outcomes. Choosing the right metric is crucial for achieving precise and valuable analyses, especially with mixed-type data.
Distance metrics are mathematical tools that measure how similar or different two data points are within a dataset. They play an important role in clustering, classification, and reducing dimensions, as they define the “closeness” or “separation” of two observations in a feature space.
Distance metrics are essential to numerous machine learning algorithms, as they influence how data points are categorized or clustered. Selecting the appropriate metric is crucial for achieving accurate and significant outcomes, especially when dealing with different types of data.
For continuous variables, the following metrics are commonly used:
the simplest measure, determining the direct distance between two points in a multi-dimensional environment.
Formula: \[ d(x, y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2} \]
Use Case: Works well for normalized continuous data.
Manhattan Distance:
Use Case: Appropriate for data with lower dimensions or when linear variations are more significant.
Canberra Distance:
Places greater emphasis on minor differences, particularly for values that are low in sizes.
Formula: \[ d(x, y) = \sum_{i=1}^n \frac{|x_i - y_i|}{|x_i| + |y_i|} \]
Use Case: Ideal for datasets with variables of small sizes.
For binary variables, the following measurements are commonly utilized:
Takes into account both matches and mismatches.
Formula: \[ d(x, y) = 1 - \frac{a + d}{p} \]
Use Case: Ideal for datasets where matches are as important as mismatches.
For categorical variables, the following metrics are effective:
For datasets with mixed variable types (continuous, binary, and categorical), a flexible metric is required:
Analyzing different distance metrics gave important information about how the observations in the dataset are related. Each metric has its own strengths suited for certain types of data.
Although individual metrics gave more insight into specific data types, the Gower Distance was chosen for the final Multidimensional Scaling (MDS) analysis. This decision guarantees that all types of variables are accurately represented, reflecting the mixed nature of the dataset and providing a complete view of the underlying patterns.
Multidimensional Scaling (MDS) is a technique that reduces dimensions by using distance matrices to show high-dimensional data in a simpler form. Its aim is to keep the distances between pairs of observations as close as possible in the new space.
The steps are:
MDS is particularly useful for visualizing patterns, clusters, and relationships in high-dimensional data.
Reduce the dimensions of a mixed-type dataset with MDS and Gower’s distance.
<div>
<img src="figures/cumulative_var_explained.png" alt="Cumulative Variance Explained" style="width:100%;">
<p style="text-align: center;">Figure 1: Cumulative Variance Explained</p>
</div>
<div>
<img src="figures/pc1_pc2_visualization.png" alt="MDS Configuration" style="width:100%;">
<p style="text-align: center;">Figure 2: MDS Configuration</p>
</div>